from pandas.api.types import CategoricalDtype
import pandas as pd
import numpy as np
from IPython.display import IFrame
import matplotlib as plt
from sodapy import Socrata
import plotly
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
sns.set(color_codes=True)
pd.options.mode.chained_assignment = None # default='warn'
client = Socrata("data.cityofnewyork.us",app_token="36wyqW3Inn5mtl4QiU8toObHD",
username="jjensen@mointern.nyc.gov", password="Timspw19!")
output = client.get("fhrw-4uyv", where="created_date >= '2019-09-13'",order="created_date DESC", limit=150000)
# Data Cleaning and prepping of dataframes
df = pd.DataFrame(output)
df['created_date'] = pd.to_datetime(df['created_date'])
df['closed_date'] = pd.to_datetime(df['closed_date'])
trend = df.groupby(df.created_date.dt.floor('d')).count()[['descriptor']]
clean_df = df[['agency','created_date','closed_date','complaint_type','descriptor','latitude','longitude',
'status']]
# Tickets that remain open do not have closed timestamp so the must me dropped
clean_df = clean_df[clean_df['closed_date'].notnull()]
clean_df.loc[:,'mins_to_resolve'] = (clean_df['closed_date'] - clean_df['created_date']) / np.timedelta64(1, 'm')
clean_df = clean_df.query("status == 'Closed'")
top_complaints = clean_df.groupby('complaint_type').size().sort_values(ascending=False).head(15)\
.reset_index().rename(columns={0:'count'})
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
cat_type = CategoricalDtype(categories=cats, ordered=True)
# Parking only dataframe
parking_df = clean_df.query("complaint_type == 'Illegal Parking'")
parking_df.loc[:,'week_day'] = parking_df['created_date'].dt.weekday_name
clean_parking = parking_df.assign(date = pd.to_datetime(parking_df['created_date'], format='%m/%d/%Y %I:%M:%S %p'))
trend_avg = clean_parking.groupby([clean_parking.date.dt.hour, 'week_day'])\
['mins_to_resolve'].mean().reset_index()\
.rename(columns={'mins_to_resolve':'avg_mins_to_resolve',
'date':'hour'})
trend_median = clean_parking.groupby([clean_parking.date.dt.hour, 'week_day'])\
['mins_to_resolve'].median().reset_index().rename(columns={'mins_to_resolve':'median_mins_to_resolve',
'date':'hour'})
from pandas.api.types import CategoricalDtype
For context, the dataset I'm working with is 311 requests for parking violations. I'm an avid biker (when healthy) and I've recently learned that you can submit a ticket via 311 when there's a car parked in a bike lane. This will trigger a "parking violation request" which will then be submitted to NYPD (in real time), who then are required to go to the reported location and determine if they should issue a ticket. I'm interested in NYPD response times to 311 requests -- is there a certain day of the week, time of day, or even location within the city that they might more quickly respond to.
# Measures of central tendency and variation
parking_df['mins_to_resolve'].describe()
The numeric descriptions paint the picture of a highly skewed dataset -- with a median value one half of a standard deviation less than the mean, it's clear we have a positively skewed distribution with an extremely long right tail. Given that we're looking at resolution time in minutes, it's important to be aware of datapoints that could be represented by months because they can greatly affect our ability to draw any insight from the data.
In wanting to understand why differences exist in response time I looked into the day and hour in which the 311 request was created. The "Distribution of Mean Response Time per Day of the Week" shows a pattern where the difference between the 75th percentile and 25th percentile response times for Friday, Saturday and Sunday are larger than any other day of the week.
The scatter plot below explores one level deeper to reveal any relationship between response time and the hour a request was created. While there doesn't appear to be a correlation between hour created and response time during the work week, there does appear to be a positive relationship on Friday and Saturday.
fig = px.histogram(clean_df, x="mins_to_resolve", nbins=50)
fig.update_layout(title_text = "Distribution of Resolution Time for 311 Parking Violations")
fig.show()
temp = parking_df.groupby(parking_df['created_date'].dt.weekday_name).size()\
.reset_index().rename(columns={'created_date':'week_day',0:'count'})
temp.week_day = temp.week_day.astype(cat_type)
temp.sort_values(by='week_day', inplace=True)
fig = go.Figure([go.Bar(x=temp.week_day, y=temp['count'])])
fig.update_layout(title_text = "Total Number of 311 Parking Requests per Day of the Week")
fig.show()
As seen in the boxplots below, extreme outliers prevent us from exploring any potential trends in differences in response time among days. In order to overcome this issue, I display a boxplot of average response times per day of the week below.
fig = px.box(parking_df.query("mins_to_resolve < 10000"), x="week_day", y="mins_to_resolve", points="all")
fig.show()
trend_avg['week_day'] = trend_avg['week_day'].astype(cat_type)
trend_avg.sort_values(by='week_day', inplace=True)
fig = px.box(trend_avg, x="week_day", y="avg_mins_to_resolve", points="all")
fig.update_layout(title_text="Distribution of Mean Response Times per Day of the Week")
fig.show()
sns.set(color_codes=True)
g = sns.lmplot(x="hour",y="avg_mins_to_resolve",data=trend_avg, height=4, aspect=.8,col_wrap=4,
col="week_day", hue="week_day")
plt.pyplot.subplots_adjust(top=.9)
g.fig.suptitle("Relationship Between Hour of the Day and Mean Response Times per Day of Week",
fontsize=20)
Numeric exploratory data analysis provides us with a general understanding of the distribution of the data. By interpreting the mean, median and standard deviation, we gain insight into the underlying skew or bias of the dataset. Visual EDA expands upon numeric EDA, highlighting potential explanations for the datasets distribution, such as clustering about categorical variables. For example, say we're conducting EDA on a NYC Subway travel times dataset. Numeric analysis can inform us of the average wait times at specific stations, while visual EDA can reveal potential clustering of average wait times in stations by subway line.
(1) Beyond the fact that it's a pie chart, the "bad microsoft" visualization incorrectly conflates percentage of a whole with total counts. Is this chart supposed to represent every single feature ever added to Word? If so, does the latest version contain all of the features? (2) The y-axis scale is so small that any increase is effectively stil 0. So displaying any significant change in "flying saucers", "extraterrestrials", or "alien abduction" is misleading. Also it's unclear what the y-axis represents. Percentage of what?
(1) John Snow's cholera map is an excellent data visualization because it clearly reveals a pattern: there are specific areas where more incidents have occured than anywhere else. This visualization enables anyone conducting EDA to then ask futher questions, such as what is it about this particular area that leads to more incidents -- well, there's a nearby water pump that may be the source of the disease. (2) The second "good" visualization is of the change in population growth among European countries. The contrastic yellow, white, and blue make it immediately clear which countries (and counties) and increasing, decreasing, or maintaining population growth.
We should use EDA at the initial stages of any project, prior to diving into the dataset and building our models. EDA enables us to understand the quality of the underlying data -- if we're operating with a dataset that is severly skewed, or is missing a majority of rows, the project could yield uninformative or even misleading results. Having a basic understanding of patterns that exist within the data can help inform the research process, either prioritizing ideas to further explore or identifying new questions.
John Tukey saw "confirmatory" versus "exploratory" analysis as a tension between the desire to employ tools to answer a question and the power of knowing what question to ask. Tukey viewed EDA as a combination of attitude, flexibility and transparency (aka some graph paper). It requires a willingness to explore the data prior to employing tools to test preconceived assumptions. According to Tukey, an example of EDA is "picture-examining", since "the eye is the best finder we have of the unanticipated." Exploratory is akin to detective work -- gathering and reviewing evidence in an attempt to identify patterns within the data. One we've established our question, Tukey viewed "confirmatory" analysis as the toolkit (significance tests, inference, etc.) required to test our hypothesis.
%%html
<img src="bad_microsoft.png", width=700, height=700>
%%html
<img src="ngram-alien-abduct.png", width=700, height=700>
%%html
<img src="John-Snows-Cholera-map.jpg", width=700, height=700>
%%html
<img src="population_change.jpg", width=700, height=700>